Optimizing Shuffle Performance in Spark
نویسندگان
چکیده
Spark [6] is a cluster framework that performs in-memory computing, with the goal of outperforming disk-based engines like Hadoop [2]. As with other distributed data processing platforms, it is common to collect data in a manyto-many fashion, a stage traditionally known as the shuffle phase. In Spark, many sources of inefficiency exist in the shuffle phase that, once addressed, potentially promise vast performance improvements. In this paper, we identify the bottlenecks in the execution of the current design, and propose alternatives that solve the observed problems. We evaluate our results in terms of application level throughput.
منابع مشابه
Scaling Spark on Lustre
We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. We find that design decisions made in the development of Spark are based on the assumption that Spark is constrained primarily by network latency, and that disk I/O is comparatively cheap. These assumptions are not valid on Edison or Co...
متن کاملClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a s...
متن کاملAn Effective High-Performance Multiway Spatial Join Algorithm with Spark
Multiway spatial join plays an important role in GIS (Geographic Information Systems) and their applications. With the increase in spatial data volumes, the performance of multiway spatial join has encountered a computation bottleneck in the context of big data. Parallel or distributed computing platforms, such as MapReduce and Spark, are promising for resolving the intensive computing issue. P...
متن کاملMapReduce with communication overlap (MaRCO)
MapReduce is a programming model from Google for cluster-based computing in domains such as search engines, machine learning, and data mining. MapReduce provides automatic data management and fault tolerance to improve programmability of clusters. MapReduce’s execution model includes an all-map-to-all-reduce communication, called the shuffle, across the network bisection. Some MapReductions mov...
متن کاملA Workload-Specific Memory Capacity Configuration Approach for In-Memory Data Analytic Platforms
Nowadays, in-memory data analytic platforms, such as Spark, are widely adopted in big data processing. The proper memory capacity configuration has been proved to be an efficient way to guarantee the workload performance in such platforms.Currently, Spark adopts the static way to configure the memory capacity for workloads based on user specifications. However, due to the lack of deep knowledge...
متن کامل